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Abstract 



Alternative approaches are discussed for use of e-rater® to score the TOEFL iBT® Writing test. 
These approaches involve alternate criteria. In the 1st approach, the predicted variable is the 
expected rater score of the examinee’s 2 essays. In the 2nd approach, the predicted variable is the 
expected rater score of 2 essay responses by the examinee on a parallel form. This 2nd approach 
is related to prediction of the expected rater score of 2 essay responses on an actual form taken 
later by an examinee. The relationship of e-rater scores to scores on other sections of TOEFL® is 
also considered. These alternative approaches suggest somewhat different procedures for scoring. 
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In the TOEFL iBT® Writing test, e-rater® is currently used in scoring of the two writing 
tasks. In this report, criteria are considered for evaluation of the current scoring procedure and 
for selection of other scoring procedures that make use of e-rater. In section 1, the criterion of 
scoring accuracy is considered. In section 2, behavior of e-rater and human scores on repeat 
examinations is explored. In section 3, analysis from sections 1 and 2 is applied to provide an 
analysis based on reliability measurement. In section 4, the relationship of e-rater and human 
scores to other portions of the TOEFL iBT test is examined. In section 5, some conclusions are 
provided. Analysis is somewhat restricted because only two prompts are available in the Writing 
assessment for a given administration. In addition, the data are of quite variable quality, and the 
results are quite dependent on the criteria used. 

1 Scoring Accuracy 

In the TOEFL iBT examination, writing is assessed by use of two tasks: an independent task 
in which an opinion must be supported in writing and an integrated task in which an examinee 
must respond to both reading material and spoken material. At the introduction of the TOEFL 
iBT examination, two human raters normally scored each task, and each rater assigned a holistic 
score of 1 to 5 to the response. Some exceptions to this situation arose in such cases as off-topic 
responses, blank responses, and scores assigned to the same response that differed by more 
than one point. During 2009, e-rater and a human rater were typically employed to score the 
independent task, although some exceptions arose due to essays not appropriate for e-rater or due 
to large discrepancies between e-rater and human scores. Beginning late in 2010, both in the case 
of the independent task and in the case of the integrated task, e-rater and a human rater have 
typically been employed to score the response. 

To date, e-rater has been treated in the TOEFL® test as a substitute score for a human 
score, save that e-rater produces a continuous score and human raters produce only integer scores. 
In this report, e-rater will be regarded as a predictor of a human score rather than as a substitute. 
This approach can yield somewhat different results in terms of weighting of human and e-rater 
scores and in terms of definitions of unusual discrepancy. The basic data in this part of the 
analysis are obtained from a sample of 139,134 TOEFL examinees for whom two human scores 
from 1 to 5 and an e-rater score were available for the responses to both the independent and 
integrated tasks. The data are from administrations in the first 10 months of 2008. They are not 
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a random sample of examinees, for the number of responses per prompt is nearly constant but the 
number of examinees in an administration is quite variable. This procedure can be expected to 
overweight examinees in Western countries and to underweight examinees in Eastern countries. 

In addition, the data do not include several months in 2008 in which examinee performance is 
typically relatively high. A further bias results from the fact that the administrations with highest 
volumes are typically in times of the year that are relatively less represented in the sample. In 
short, any analysis based on these data should be approached with the caution appropriate for a 
biased sample. 

In the analysis in this report, the e-rater scores for each tasks are derived from a linear 
regression using the generic-model approach on the sampled data from 2008 (Attali & Burstein, 
2005). For each task, a linear transformation is applied to e-rater scores to match the sample mean 
and sample variance of the average of two human ratings for the sample of responses used in the 
regression analysis. In current practice, e-rater scores are truncated so that no score is permitted 
to be less than 0.5001 or greater than 5.4999. This truncation affects a relatively small fraction of 
the sample, about 0.5% of examinees for the independent task and about 3.76 of examinees for the 
integrated task. For each task, about 80 of truncations arise because the e-rater score is less than 
0.5001 without truncation. In this report, both e-rater scores with truncation and e-rater scores 
without truncation are considered. 

1.1 Human Scoring 

To begin analysis, it is appropriate to summarize some basic features of the human scoring 
for the sample under study. For the integrated task, the average of human scores is 3.09, and the 
average score for the independent task is 3.39. The tasks differ substantially in terms of variability. 
For the integrated task, the sample standard deviation of a human rater is 1.19, and the sample 
standard deviation of a human rater is 0.83 for the independent task. Prior to use of e-rater, the 
raw score S r for the Writing test was normally the sum S of the average human rating for the 
integrated task and the average human rating for the independent task, although exceptions arose 
when two human scores on the same response differed by more than 1. The scaled Writing score 
was then obtained by rounding a linear transformation of the raw score to the nearest integer 
within the range 0 to 30. The sample correlation of S and S r is 0.996, and S and S r differ for only 
about 2.2% of the examinees in the sample. As a consequence, it appears reasonable to apply S 
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in subsequent analysis. This result is consistent with earlier work at ETS in the 1980s (Mazzeo, 
Schmitt, Sz Cook, 1986a, 1986b). Use of S has the advantage that linear theory is readily applied. 

Both S r and S have sample mean 6.47 and standard deviation 1.71. The average of the two 
human scores for the integrated task has sample standard deviation 1.14, and the average of the 
two human scores for the independent task has sample standard deviation 0.76. The sample 
correlation of the average human score for the integrated task and the average human score on the 
independent task is 0.61. In the case of two items with unequal sample variances, the estimated 
Cronbach a (Lord & Novick, 1968, p. 204) of the score S is 

4(1.14)(0.76)(0.61) 
as = 171 2 = °- 72 - 

Thus reliability of the Writing test is an obvious cause for concern. Nonetheless, as is well known, 
a provides a lower bound to reliability, so that the actual reliability may be higher. Some further 
work on estimation of this reliability will be considered in section 2. 

An appreciable fraction of the error of measurement of the raw score S is clearly due to the 
human rating, although the reliability of S is not very high even if variability of human raters 
would disappear. For a pair of examinee responses, let U be the expected value of S obtained 
by regarding the human raters as randomly drawn from the pool of raters used in scoring. The 
estimated Cronbach a of S' implies that the estimated variance of measurement of S is no greater 
than 0.83 = 1.71 2 (1 — 0.72). For either of the two tasks, the estimated variance of scoring error is 
one quarter of the mean squared difference of the two corresponding rater scores. The estimated 
variances are 0.12 for the integrated task and 0.11 for the independent task. The estimated 
variance of scoring error for S is 0.24 (note rounding error), the sum of the estimated variances of 
scoring error for the two tasks, so that the estimated variance of measurement of U is no greater 
than 0.83 — 0.24 = 0.59, the estimated variance of U is 1.71 2 — 0.24 = 2.70, and the Cronbach a of 
U is still only au = 1 — 0.59/2.70 = 0.78. 

1.2 Prediction of Human Scoring by e-rater 

By Kelley’s formula (Kelley, 1947), the best linear predictor of U by S is estimated to be 
6.47 + R 2 (S — 6.47), where the estimated coefficient of determination R? is 1 — 0.24/2.70 = 0.91. 
A basic question involving automated scoring is how well U can be predicted by one human score 
h g i on the integrated prompt, one human score h c ii on the independent prompt, the e-rater score 



3 




e g on the integrated prompt, and the e-rater score on the independent prompt. This question is 
closely related to a similar issue that has been studied for a single essay score (Haberman &; Qian, 
2007). The same methodology leads to the following prediction for U: 

U = 0.74 + 0.79V + 0.16e g + 0.36V + 0.47erf. 

In this prediction formula, a best linear predictor of U from h g i, V , e g , and is constructed. 
The covariance matrix of the predictors is readily estimated from the available sample. The 
covariance of U and e g \ is the same as the covariance of S and e g \ because the e-rater score 
does not involve the rater error, and the covariance of S and e g \ is readily estimated from the 
sample data. Similarly, the covariance of U and e^i can be estimated without difficulty. Let h g \ 
be decomposed into the sum H g + r g i, where the scoring error r g \ has mean 0 and is uncorrelated 
with H g . Let hdi be decomposed into the sum Hd + r^i , where the scoring error rg \ has mean 0 
and is uncorrelated with H g , Hd, and r g \. Then the covariance of U and h g \ is the sum of the 
variance of H g and the covariance of h g \ and h g \. The covariance of U and hdi is the sum of the 
variance of Hd and the covariance of h g \ and hdi ■ The covariance of h g \ and hdi can be estimated 
easily from the sample data. The variance of H g is the variance of h g \ , which is estimated from 
the sample data, minus the variance of r g \ , which is twice the estimated variance of scoring error 
for the integrated task. A similar argument applies to Hd- 

Given the existing estimate of the variance of U, one finds that the coefficient of determination 
R 2 is 0.83. By this criterion, prediction of U by h g i, e g , hdi, and rather than S does entail an 
appreciable decrease in R 2 . 

The prediction of U is almost unaffected if e-rater is not considered at all for the integrated 
prompt. In this case, the prediction for U is 

U\ = 0.59 + 0.83V + 0.38 V + 0.61 e^. 

The resulting R 2 is 0.82. 

If e-rater scores and human scores are used interchangeably, so that U is predicted by a linear 
function of 

Q — (hgi T Sg T hdi T G-d) / 2, 

then the predictor is 

U-2 = 0.75 + 0.88Q 
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and R 2 decreases to 0.79. If human scores are used but no e-rater scores are employed, then the 
predictor is 

t > 3 = 1.15 + 0.93/i g i + 0.72/1* 

and R 2 is also 0.79. A simple alternative predictor is a linear function of 



Q i — h g i + (hd\ + e^)/2. 



In this case, the predictor is 



U 4 = 0.75 + O. 88 Q 1 



and R 2 is 0.82. 

During meetings in 2010 of the Technical Advisory Committee on Automatic Scoring, other 
alternative predictors with simple weights were suggested. These include the following: 



and 



One has the predictor 



2h 3 i + c <7 hd\ T ed 

~ 3 2 



Q3 



2 /i 9 i + e g + /idi + ed 
273 



Q 4 



2/igi + e 9 + Tirfi + 2erf 

3 



U 5 = 0.63 + 0.90Q2 



with R 2 = 0.81, the predictor 



U 6 = 0.92 + 0.87Q 3 



with R 2 = 0.82, and the predictor 

U 7 = 0.66 + O. 9 OQ 4 



with R 2 = 0.81. 

The current use of e-rater for scoring both TOEFL prompts is not exactly a linear function of 
the human scores h g 1 and hd\ and the e-rater scores e g and e^. As already noted, the e-rater scores 
are truncated. In addition, sufficiently large discrepancies between e-rater and corresponding 
human scores lead to additional use of human raters. The resulting approximation to S will 
be written as V . The lack of linearity prevents use of the analytical methods in this section. 
Nonetheless, the functions 5, U, U \ , U 2 , U 3 , U 4 , U 5 , Uq, U 7 , and V used in prediction of U are all 
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quite highly correlated, as is evident from Table 1. The relationship of U 2 to S is relatively weaker 
than is the case for the other estimates. 



Table 1 



Correlations of Predictors of Score U 



Predictor S 


U 


U x 


u 2 


U 3 


u A 


u 5 


u 6 


U 7 


V 


S 


1.00 


0.96 


0.96 


0.93 


0.96 


0.96 


0.95 


0.95 


0.94 


0.95 


U 


0.96 


1.00 


1.00 


0.98 


0.98 


1.00 


0.99 


1.00 


0.99 


0.98 


lh 


0.96 


1.00 


1.00 


0.98 


0.98 


0.99 


0.98 


0.99 


0.98 


0.97 


U 2 


0.93 


0.98 


0.98 


1.00 


0.92 


0.95 


0.99 


0.99 


0.99 


0.98 


u 3 


0.96 


0.98 


0.98 


0.92 


1.00 


0.99 


0.95 


0.96 


0.94 


0.95 


Ua 


0.96 


1.00 


0.99 


0.95 


0.99 


1.00 


0.98 


0.98 


0.98 


0.97 


U 5 


0.95 


0.99 


0.98 


0.99 


0.95 


0.98 


1.00 


1.00 


1.00 


0.99 


Ue 


0.95 


1.00 


0.99 


0.99 


0.96 


0.98 


1.00 


1.00 


1.00 


0.98 


U 7 


0.94 


0.99 


0.98 


0.99 


0.94 


0.98 


1.00 


1.00 


1.00 


0.98 


V 


0.95 


0.98 


0.97 


0.98 


0.95 


0.97 


0.99 


0.98 


0.98 


1.00 



By the criterion of prediction of human scoring, it follows that e-rater has modest utility and 
nearly all value of e-rater is provided by e-rater for the independent prompt. 

2 Analysis of Repeaters 

Data from the TOEFL examinees included 7,747 examinees who repeated the TOEFL 
examination and had two human scores from 1 to 5 and e-rater scores for both Writing prompts for 
both administrations studied. These data are obviously rather biased given that most examinees 
do not repeat the TOEFL examination. As evident from Table 2, the distribution of test country 
in the sample of repeaters is very different than the distribution for the main sample. 



Table 2 



Distribution 


of Examinees by 


Test Country 




Percent of sample 


Country 


Main 


Repeater 


China 


12.0 


8.0 


India 


7.8 


2.5 


Japan 


6.9 


13.8 


South Korea 


13.4 


26.6 


United States 


18.1 


24.1 


Other 


41.8 


25.1 



In addition, the distribution of examinee scores is somewhat different in the repeater sample. 
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In the main sample, S has a mean of 6.47, a standard deviation of 1.71, and a Cronbach a of 
0.72. In the repeater sample, for the first administration for an examinee, the mean of S is 6.02, 
the standard deviation is 1.56, and a is 0.66. For the second administration, S has mean 6.35, 
standard deviation 1.52, and a of 0.65. In view of the bias of the sample, considerable caution 
must be used in application of the data. 

To begin, consider prediction of S for the second TOEFL test of the examinee from S, U, Ui, 
U 2 , Us, U 4 , and V from the first TOEFL test. Results are summarized in Table 3. 



Table 3 

Correlations of TOEFL Scores on a Repeat Administration 
to TOEFL Scores on an Initial Administration 



Predictor at first 


S at second 


administration 


administration 


R 


R 2 


S 


0.73 


0.53 


U 


0.73 


0.54 


U\ 


0.73 


0.53 


U 2 


0.73 


0.54 


Us 


0.69 


0.48 


Ua 


0.72 


0.52 


U 5 


0.74 


0.55 


Ue 


0.73 


0.54 


U 7 


0.74 


0.55 


V 


0.74 


0.54 



These results suggest that, with the exception of C/3, the predictors are all quite comparable 
in terms of prediction of S at the second administration. The extent to which sample bias affects 
the results remains an important question. 



3 Reliability Analysis 

In typical applications of augmentation, sample means, sample variances, sample covariances, 
and estimated Cronbach a values of components of a composite score are used to examine 
appropriate linear weighting of the observed components in order to estimate a true score of one 
or more test components (Haberman, 2008; Wainer et ah, 2001). Best linear predictors are used. 
In the case under study, complications arise because each test component includes only one item, 
so that a Cronbach a cannot be estimated for each test component. In this section, an attempt at 
augmentation analysis is made by use of some data on repeaters in order to estimate reliability 



7 




of each item score. Because the repeater analysis involves sample bias, estimation of reliability of 
tasks is obtained with minimal use of the information based on repeater data. The analysis in this 
section is somewhat different than the analysis in Section 1 , for errors due to examinee variation 
on item responses are considered along with variation due to scoring error. 

The following decompositions divide scores into true scores, errors exclusive of rater errors, 
and rater errors: 

hgj — H Tg T H Eg T T'gj •? 1 ^ jT 2, 

hdj = H T dL + H E d + rdj, 1 < j < 2 , 
e g = £Tg + e E g, 
e d = er d + e E d- 

The true scores H Eg , H Ed , e Eg , and e Ed are uncorrelated with the errors H Eg , r g j, H Ed , r d j, 
e Eg , and e E d ■ Errors from different prompts are uncorrelated, so that H Eg , r g j, and e Eg are not 
correlated with H Edl r dg , and e E d . In addition, the rater errors r g \ and r g 2 are uncorrelated with 
each other and with H Eg and e Eg , and the rater errors r d \ and r d 2 are uncorrelated with each 
other and with H Ed and e E d ■ The expected value of each error component is 0 , the variances of r g \ 
and r g 2 are both a 2 rg and the variances of r f i\ and rd2 are both a 2 d . The variance of H Eg is o 2 HTg , 
the variance of H Ec l is & EEd , and similar notation is used for other variances. The covariance of 
H Eg and H E d is 7 THHgd, the covariance of e Eg and e E d is ") fTEEgd , and similar conventions are 
applied to other covariances. The true sum 

W = H Tg + H Td 

is to be estimated by use of the human scores h g \ and h d \ and the e-rater scores e g and e d - Note 
that U in Section I is W + H Eg + H Eg . In the analysis of the main sample, it is a straightforward 
matter to estimate the covariance matrix of the vector with elements h g 1, h d 1, e g , and e d and to 
estimate the variances af. g and a 2 rd . On the one hand, the methods of this section also demand 
estimation of the covariance matrix of the vector with elements H Eg) H Ed , e Eg , and e Ed ■ In the 
latter case, estimation of 7 THHgd-, iTHEgd-, iTEHgd, and 7 TEEgd is readily accomplished with the 
complete sample. For example, r )THHgd is also the covariance of h g \ and h d \ . On the other hand, 
other elements of the covariance matrix of the true scores cannot be obtained from conventional 
analysis without very strong assumptions. 




Nonetheless, the repeater data can be employed to obtain plausible estimates of the covariance 
matrix of the true scores. Consider the case of 7 THEgg- The repeater data provide an estimate 
It H E gg equal to the average of the sample covariances of h g j from the first administration and 
e g from the second administration and e g from the first administration and h g j from the second 
administration for j equal 1 or 2. Because bias is a concern, it is probably prudent also to consider 
an estimate 7 HEgg based on the average of the sample covariances of h g j and e g for j equal 1 or 2 
for the same administration. Let "iHEgg be the average of the sample covariances from the main 
sample of e g and e d . Then the estimate of "/THEgg is 

IT HEgg = iTHEgglHEgg / iHEgg- 

This estimate is appropriate if the ratio 7 THEggllHEgg between covariances of true scores Hx g 
and eTg and covariances of observed scores h g 1 and e g is the same as the corresponding ratio 
of covariances conditional on being from the repeater population. It is not assumed that the 
covariance of h g \ and e g is the same for the complete and repeater populations. Indeed it is clear 
from the data that these covariances are different. To be sure, no way exists to be certain that the 
assumption used to derive 7 THRgg is actually valid, but the assumption is at least more limited 
than the assumption of equal covariances for complete and repeater populations. 

Similar arguments can be used to estimate all needed variances and covariances of true scores. 
For this analysis, the optimal prediction of IF is 

W = 1.42 + 0A4h g i + 0.27e 9 + 0.27h dl + 0.58e d . 

The resulting R 2 is 0.79. For prediction of U rather than IF, a linear function of IF can be 
obtained with an R 2 of 0.80, so that IF is a bit less effective as a predictor of scores. 

Some comparisons with alternative estimates are worth consideration. Consider estimation 
by just the human scores h g \ and h d \ and the e-rater score e d for the independent prompt. In this 
case, the optimal prediction of IF is 

IFi = 1.16 + 0.51 V + 0.30 V + 0.80e d . 

The R 2 is 0.77. If just two human scores are used, then the optimal prediction of IF is 

W 2 = 1.91 + 0.65V + 0.76 V- 
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The R 2 is 0.69. 

The optimal linear function of the equally weighted score Q is 

W 3 = 1.54 + 0.76 Q, 

and the resulting R 2 is 0.79. The optimal linear function of S is 

W 4 = 1.65 + 0.745, 

and R 2 is 0.74. 

If a linear function of U is employed, then the optimal predictor is 

W 5 = 1.70 + 0.8317, 

and R 2 is 0.77. 

If a linear function of Q\ is employed, then the optimal predictor is 

W 6 = 1.78 + 0.72Qi, 

and R 2 is 0.74. 

If a linear function of Q 2 is employed, then the optimal predictor is 

W 7 = 1.52 + 0.77 Q 2 , 

and R 2 is 0.79. 

If a linear function of Q 3 is employed, then the optimal predictor is 

W 8 = 1.80 + 0.73Q3, 

and R 2 is 0.78. 

If a linear function of Q 4 is employed, then the optimal predictor is 

1+9 = 1.52 + 0.77<5 4 , 

and R 2 is 0.79. 

The reliability analysis thus suggests different weights than suggested by the analysis of 
scoring accuracy in Section 1 . One issue of note here is that e-rater scores are much more reliable 
than are human scores. Recall that S has a Cronbach a of 0.72. In contrast, from the complete 
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data, one finds that an assessment score that is a weighted linear combination of the two e-rater 
scores can have a Cronbach a as high as 0.86. 

It should be noted that in any actual application of weighted scores, linking to the score S is 
required in order to preserve an appropriate reporting scale. 

4 Relationships to Other Section Scores 

It is helpful to examine the relationship of raw scores for Writing with the scaled scores for 
Reading, Listening, and Speaking. For this purpose, the main sample can be employed, and linear 
regressions on the three scores are appropriate. A summary of results is provided in Table 4. 
This table does not discriminate very much between different estimates, but it does suggest some 
weakness in C/ 3 , which does not use e-rater and uses only two human scores. 



Table 4 



Regressions of Scores on Writing on 
Other Scaled Section Scores 



Dependent variable 


R 2 


S 


0.65 


U 


0.64 


ih 


0.63 


U 2 


0.63 


u 3 


0.60 


t>4 


0.62 


t>5 


0.64 


U e 


0.64 


u 7 


0.64 


w 


0.63 



5 Conclusions 

The analysis does not provide entirely consistent conclusions, but it appears that the current 
implementation of e-rater scoring has no obvious advantage over implementations that do not 
require additional human scorers. Table 5 summarizes different criteria for performance of 
alternative scoring systems. Note that this table adds a few analyses not previously described, 
and note that C /2 and W 3 are equivalent, for both are linear functions of Q. Similar issues affect 
C/ 4 , C/ 5 , Uq, C/ 7 , W 6 , W 7 , Wg , and Wg. Scoring accuracy involves estimation of the score U, while 
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reliability analysis involves estimation of W . 



Table 5 



Coefficients of Determination of Scores Based on Alternate Criteria 



Variable 


Scoring accuracy 


Repeaters 


Reliability analysis 


Other sections 


S 


0.91 


0.53 


0.74 


0.65 


V 




0.54 




0.64 


U 


0.83 


0.54 


0.77 


0.64 


Ui 


0.82 


0.53 


0.75 


0.63 


U2 


0.79 


0.54 


0.79 


0.63 


Ui 


0.79 


0.48 


0.68 


0.60 


c/4 


0.82 


0.52 


0.74 


0.62 


th 


0.81 


0.55 


0.79 


0.64 


Chi 


0.82 


0.54 


0.78 


0.64 


u 7 


0.81 


0.55 


0.79 


0.64 


w 


0.80 


0.54 


0.79 


0.63 



The evenly weighted option U 2 appears to be viable, although it exhibits some weakness in 
terms of correlation with the actual human score. Options U, U 5 , Uq, C/ 7 , and W all appear 
reasonable, and ZC /3 is relatively weak on all criteria other than scoring accuracy. Option V . 
when it can be evaluated, is comparable to options U, C/ 5 , Uq. C/ 7 , and W; however, it is far more 
expensive to employ due to the greatly increased use of human scoring. The choice of options 
depends on the priorities assigned to the reliability analysis and the scoring analysis. The scoring 
analysis makes fewer assumptions. The reliability analysis appears to provide reasonable results, 
but its use of the repeater data for some calculations is problematic. 

It is recognized that the TOEFL program may desire added human scoring in some cases in 
which e-rater and human scores appear discrepant, but such scoring should be minimized as much 
as feasible. Of course, double scoring of essays is needed to estimate rater reliability and to study 
other issues concerning rater behavior. 



12 




References 



Attali, Y., Sz Burstein, J. (2005). Automated essay scoring with e-rater® v. 2.0 (ETS Research 
Report No. RR-04-45). Princeton, NJ: ETS. 

Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral 
Statistics, 33, 204-229. 

Haberman, S. J., & Qian, J. (2007). Linear prediction of a true score from a direct estimate and 
several derived estimates. Journal of Educational and Behavioral Statistics, 32, 6-23. 

Kelley, T. L. (1947). Fundamentals of statistics. Cambridge, MA: Harvard University Press. 

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: 
Addison- Wesley. 

Mazzeo, J., Schmitt, A., & Cook, L. (1986a, April). The compatibility of adjudicated and 
non- adjudicated essay scores. Paper presented at the annual meeting of the National 
Council on Measurement in Education, San Francisco, CA. 

Mazzeo, J., Schmitt, A., & Cook, L. (1986b). The compatibility of adjudicated and 

non- adjudicated essay scores on the ATP English Composition Test with Essay. Unpublished 
manuscript, Educational Testing Service, College Board Statistical Analysis, Princeton, NJ. 

Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Swygert, K. A., & Thissen, D. (2001). 
Augmented scores — “Borrowing strength” to compute scores based on small numbers of 
items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 343-387). Mahwah, NJ: 
Erlbaum. 



13 




